A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
It is possible to achieve some benefits of column-oriented and row-oriented organization with any database. By denoting one as column-oriented we are referring to both the ease of expression of a column-oriented structure and the focus on optimizations for column-oriented workloads.[1][2] This approach is in contrast to row-oriented or row store databases and with correlation databases, which use a value-based storage structure.
Contents |
A relational database program must show its data as two-dimensional tables, of columns and rows, but store it as one-dimensional strings. For example, a database might have this table.
EmpId | Lastname | Firstname | Salary |
---|---|---|---|
1 | Smith | Joe | 40000 |
2 | Jones | Mary | 50000 |
3 | Johnson | Cathy | 44000 |
This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary).
This table exists in the computer's memory (RAM) and storage (hard drive). Although RAM and hard drives differ mechanically, the computer's operating system abstracts them. Still, the database must coax its two-dimensional table into a one-dimensional series of bytes, for the operating system to write to either the RAM, or hard drive, or both.
A row-oriented database serializes all of the values in a row together, then the values in the next row, and so on.
1,Smith,Joe,40000; 2,Jones,Mary,50000; 3,Johnson,Cathy,44000;
A column-oriented database serializes all of the values of a column together, then the values of the next column, and so on.
1,2,3; Smith,Jones,Johnson; Joe,Mary,Cathy; 40000,50000,44000;
This is a simplification. Partitioning, indexing, caching, views, OLAP cubes, and transactional systems such as write ahead logging or multiversion concurrency control all dramatically affect the physical organization. That said, online transaction processing (OLTP)-focused RDBMS systems are more row-oriented, while online analytical processing (OLAP)-focused systems are a balance of row-oriented and column-oriented.
Comparisons between row-oriented and column-oriented systems are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other delays in computers. Sometimes, reading a megabyte of sequentially stored data takes no more time than one random access.[3] Further, because seek time is improving at a slow rate relative to CPU power (see Moore's Law), this focus will likely continue on systems reliant on hard-disks for storage. Following is a set of over-simplified observations which attempt to paint a picture of the trade-offs between column- and row-oriented organizations. Unless, of course, the application can be reasonably assured to fit most/all data into memory, in which case huge optimizations are available from in-memory database systems.
In practice, row-oriented architectures are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column stores are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes). However, there are a number of proven row-based OLAP RDBMS that handles terabytes, or even petabytes of data, such as Teradata.
Column data is of uniform type; therefore, there are some opportunities for storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many popular modern compression schemes, such as LZW or run-length encoding, make use of the similarity of adjacent data to compress. While the same techniques may be used on row-oriented data, a typical implementation will achieve less effective results.[4][5]
To improve compression, several implementations (such as Vertica) sort the rows. For example, using bitmap indexes, sorting can improve compression by an order of magnitude.[6] To maximize the compression benefits of the lexicographical order with respect to run-length encoding, it is best to use low-cardinality columns as the first sort keys.[7] For example, given a table with columns sex, age, name, it would be best to sort first on the value sex (cardinality of two), then age (cardinality of <150), then name.
Columnar compression achieves a reduction in disk space at the expense of efficiency of retrieval. Retrieving all data from a single row is more efficient when that data is located in a single location, such as in a row-oriented architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data might need to be uncompressed to be read. Therefore, column-oriented architectures are sometimes enriched by additional mechanisms aimed at minimizing the need for access to compressed data.[8]
Column stores or transposed files have been implemented from the early days of DBMS development, beginning in the 1970s. For example, Statistics Canada implemented the RAPID system[9] in 1976 and used it for processing and retrieval of the Canadian Census of Population and Housing as well as several other statistical applications. RAPID was shared with other statistical organizations throughout the world and used widely in the 1980s. It continued to be used by Statistics Canada until the 1990s.
For many years, only the Sybase IQ product was commercially available in the column-oriented DBMS class. However, that has changed rapidly in the last few years with many open source and commercial implementations.
Current examples of column-oriented DBMSs include:
|